On free energy barriers in Gaussian priors and failure of cold start MCMC for high-dimensional unimodal distributions

We exhibit examples of high-dimensional unimodal posterior distributions arising in nonlinear regression models with Gaussian process priors for which Markov chain Monte Carlo (MCMC) methods can take an exponential run-time to enter the regions where the bulk of the posterior measure concentrates. Our results apply to worst-case initialized (‘cold start’) algorithms that are local in the sense that their step sizes cannot be too large on average. The counter-examples hold for general MCMC schemes based on gradient or random walk steps, and the theory is illustrated for Metropolis–Hastings adjusted methods such as preconditioned Crank–Nicolson and Metropolis-adjusted Langevin algorithm. This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.


Introduction
Bayesian statistics in the last decades, where one attempts to generate samples from some posterior distribution Π (·|data) arising from a prior Π on D-dimensional Euclidean space and the observed data vector. MCMC methods tend to perform well in a large variety of problems, are very flexible and user-friendly, and enjoy many theoretical guarantees. Under mild assumptions, they are known to converge to their stationary 'target' distributions as a consequence of the ergodic theorem, albeit perhaps at a slow speed, requiring a large number of iterations to provide numerically accurate algorithms. When the target distribution is log-concave, MCMC algorithms are known to mix rapidly, even in high dimensions. But for general D-dimensional densities, we have only a restricted understanding of the scaling of the mixing time of Markov chains with D or with the 'informativeness' (sample size or noise level) of the data vector.
A classical source of difficulty for MCMC algorithms are multi-modal distributions. When there is a deep well in the posterior density between the starting point of an MCMC algorithm and the location where the posterior is concentrated, many MCMC algorithms are known to take an exponential time-proportional to the depth of the well-when attempting to reach the target region, even in low-dimensional settings, see figure 1a and also the discussion surrounding proposition 4.2 below. However, for distributions with a single mode and when the dimension D is fixed, MCMC methods can usually be expected to perform well.
In essence this article is an attempt to explain how, in high dimensions, wells can be formed without multi-modality of a given posterior distribution. The difficulty in this case is volumetric, also referred to as entropic: while the target region contains most of the posterior mass, its (prior) volume is so small compared to the rest of the space that an MCMC algorithm may take an exponential time to find it, see figure 1b. This competition between 'energy'-here represented by the log-likelihood N in the posterior distribution dΠ (·|data) = exp{ N + log dπ }-and 'entropy' (related to the prior term π ) has also been exploited in recent work on statistical aspects of MCMC in various high dimensional inference and statistical physics models [1][2][3][4][5]. These ideas somewhat date back to the nineteenth century foundations of statistical mechanics [6] and the notion of free energy, consisting of a sum of energetic and entropic contributions which the system spontaneously attempts to minimize. The 'MCMC-hardness' phenomenon described above is then akin to the meta-stable behaviour of thermodynamical systems, such as glasses or supercooled liquids. As the temperature decreases, such systems can undergo a 'first-order' phase transition, in which a global free energy minimum (analoguous to the  Illustration of a free-energy barrier (or free-entropy well) arising with a unimodal posterior. The model is an 'averaged' version of the spiked tensor model, with log-likelihood n (θ ) = λ θ , θ 0 3 /2 and uniform prior Π on the n-dimensional unit sphere S n−1 . θ 0 is chosen arbitrarily on S n−1 . The posterior is dΠ (θ |Y) ∝ exp{n n (θ )} dΠ (θ ), for θ ∈ S n−1 . Up to a constant, the free entropy F(r) = (1/n) log dΠ (θ |Y)δ(r − ||θ − θ 0 || 2 ) can be decomposed as the sum of n (θ ) (that only depends on r = ||θ − θ 0 || 2 ) and the 'entropic' contribution (1/n) log dΠ (θ )δ(r − ||θ − θ 0 || 2 ). In the figure we show λ = 2.1. (Online version in colour.) target region above) abruptly appears, while the system remains trapped in a suboptimal local minimum of the free energy (the starting region of the MCMC algorithm). For the system to go to thermodynamic equilibrium it must cross an extensive free energy barrier: such a crossing requires an exponentially long time, so that the system appears equilibrated on all relevant timescales, similarly to the MCMC stuck in the starting region. Classical examples include glasses and the popular experiment of rapid freezing of supercooled water (i.e. water that remained liquid at negative temperatures) after introducing a perturbation.
Inspired by recent work [4,5,7], let us illustrate some of the volumetric phenomena which are key to our results below. We separate the parameter space into three regions (see figures 1 and 2), which we name by common MCMC terminology. Firstly a starting (or initialization) region S, where an algorithm starts, secondly a target region T where both the bulk of the posterior mass and the ground truth are situated, and thirdly an intermediate free-entropy well 1 W that separates S from T . 2 In our theorems, these regions will be characterized by their Euclidean distance to the ground truth parameter θ 0 generating the data. The prior volumes of the -annuli {θ : r − < ||θ − θ 0 || 2 ≤ r}, r > 0, closer to the ground truth are smaller than those further out as illustrated in figure 1b, and in high dimensions this effect becomes quantitative in an essential way. Specifically, the trade-off between the entropic and energetic terms can happen such that the following three statements are simultaneously true.
(i) T contains 'almost all' of the posterior mass.
(ii) As one gets closer to T (and thus the ground truth θ 0 ), the log-likelihood is strictly monotonically increasing. (iii) Yet S still possesses exponentially more posterior mass than W.
Using 'bottleneck' arguments from Markov chain theory (ch. 7 in [8]), this means that an MCMC algorithm that starts in S is expected to take an exponential time to visit W. If the step size is such that it cannot 'jump over' W, this also implies an exponential hitting time lower bound for reaching T . This is illustrated in figure 2 for an averaged version of the model described in §2.
In the situation described above, the MCMC iterates never visit the region where the posterior is statistically informative, and hence yield no better inference than a random number generator. 1 As classical in statistical physics, we call free entropy the negative of the free energy. 2 In a physical system, these regions would correspond respectively to a region including a meta-stable state, a region including the globally stable state and a free energy barrier. One could regard this as a 'hardness' result about computation of posterior distributions in high dimensions by MCMC. In this work we show that such situations can occur generically and establish hitting time lower bounds for common gradient or random walk based MCMC schemes in model problems with nonlinear regression and Gaussian process priors. Before doing this, we briefly review some important results of Ben Arous et al. [4] for the problem of principle component analysis (PCA) in tensor models, from which the inspiration for our work was drawn. This technique to establish lower bounds for MCMC algorithms has also recently been leveraged in [5] in the context of sparse PCA, and in [7] to establish connections between MCMC lower bounds and the Low Degree Method for algorithmic hardness predictions (see [9] for an expository note on this technique).
When the target distribution is globally log-concave, pictures such as in figure 2 are ruled out (see also remark 4.7) and polynomial-time mixing bounds have been shown for a variety of commonly used MCMC methods. While an exhaustive discussion would be beyond the scope of this paper, we mention here the seminal works [10,11] which were among the first to demonstrate high-dimensional mixing of discretized Langevin methods (even upon 'cold-start' initializations like the ones assumed in the present paper). In concrete nonlinear regression models, polynomialtime computation guarantees were given in [12] under a general 'gradient stability' condition on the regression map which guarantees that the posterior is (with high probability) locally logconcave on a large enough region including θ 0 . While this condition can be expected to hold under natural injectivity hypotheses and was verified for an inverse problem with the Schrödinger equation in [12], for non-Abelian X-ray transforms in [13], the 'Darcy flow' model involving elliptic partial differential equations (PDE) in [14] and for generalized linear models in [15], all these results hinge on the existence of a suitable initializer of the gradient MCMC scheme used. These results form part of a larger research programme [14,[16][17][18][19] on algorithmic and statistical guarantees for Bayesian inversion methods [20] applied to problems with partial differential equations. The present article shows that the hypothesis of existence of a suitable initializer is-at least in principle-essential in these results if D/N → κ > 0, and that at most 'moderately' highdimensional (D = O(N)) MCMC implementations of Gaussian process priors may be preferable to bypass computational bottlenecks.
Our negative results apply to (worst-case initialized) Markov chains whose step sizes cannot be too large with high probability. As we show this includes many commonly used algorithms (such as preconditioned Crank-Nicolson (pCN) and Metropolis-adjusted Langevin algorithm (MALA)) whose dynamics are of a 'local' nature. There are a variety of MCMC methods developed recently, such as piece-wise deterministic Markov processes, boomerang or zig-zag samplers [21][22][23][24] which may not fall into our framework. While we are not aware of any rigorous results that would establish polynomial hitting or mixing times of these algorithms for highdimensional posterior distributions such as those exhibited here, it is of great interest to study whether our computational hardness barriers can be overcome by 'non-local' methods. There is some empirical evidence that this may be possible. For instance, in the numerical simulation of models of supercooled liquids [25], methods such as swap Monte Carlo [26] have been observed to equilibrate to low-temperature distributions which were not reachable by local approaches. Another example is given by the planted clique problem [27]: this model is conjectured to possess a large algorithmically hard phase, and local Monte Carlo methods are known to fail far from the conjectured algorithmic threshold [28][29][30]. On the other hand, non-local exchange Monte Carlo methods (such as parallel tempering [31]) have been numerically observed to perform significantly better [32].

The spiked tensor model: an illustrative example
In this section, we present (a simplified version of) results obtained mostly in [4]. First some notation. For any n ≥ 1, we denote by S n−1 = {θ ∈ R n : ||θ|| 2 = 1} the Euclidean unit sphere in n dimensions. For θ, θ ∈ R n we denote θ ⊗ θ = (θ i θ j ) 1≤i,j≤n ∈ R n 2 their tensor product.
Spiked tensor estimation is a synthetic model to study tensor PCA, and corresponds to a Gaussian Additive Model with a low-rank prior. More formally, it can be defined as follows [33].

Definition 2.1 (Spiked tensor model).
Let p ≥ 3 denote the order of the tensor. The observations Y and the parameter θ are generated according to the following joint probability distribution: Here, dY denotes the Lebesgue measure on the space (R n ) ⊗p = R n p of p-tensors of size n. Π is the uniform probability measure on S n−1 , and λ ≥ 0 is the signal-to-noise ratio (SNR) parameter. In particular, the posterior distribution Π (θ |Y) is in which Z Y is a normalization, and we defined the log-likelihood (up to additive constants) as In the following, we study the model from definition 2.1 via the prism of statistical inference. In particular, we will study the posterior Π (θ |Y) for a fixed 3 'data tensor' Y. Since such a tensor was generated according to the marginal of (2.1), we parameterize it as Y = λ √ nθ ⊗p 0 + Z, with Z a p-tensor with i.i.d. N (0, 1) coordinates, and θ 0 a 'ground truth' vector uniformly sampled in S n−1 . The goal of our inference task is to recover information on the low-rank perturbation θ ⊗p 0 (or equivalently on the vector θ 0 , possibly up to a global sign depending on the parity of p) from the posterior distribution Π (·|Y).
Crucially, we are interested in the limit of the model of definition 2.1 as n → ∞. In particular, all our statements, although sometimes non-asymptotic, are to be interpreted as n grows. We say that an event occurs 'with high probability' (w.h.p.) when its probability is 1 − O n (1). 4 Moreover, by rotation invariance, all statements are uniform over θ 0 ∈ S n−1 , so that said probabilities only refer to the noise tensor Z. Finally, throughout our discussion we will work with latitude intervals (or bands) on the sphere, with the North Pole taken to be θ 0 . We characterize them using inner products (correlations) θ, θ 0 for odd p, and | θ , θ 0 | for even p (since in this case θ 0 and −θ 0 are indistinguishable from the point of view of the observer).

(a) Posterior contraction
We can use uniform concentration of the likelihood to show that as λ → ∞ (after taking the limit n → ∞) the posterior contracts in a region infinitesimally close to the ground truth θ 0 . We first show that a region arbitrarily close to the ground truth exponentially dominates a very large starting region: (2.5) The proofs of proposition 2.3 and corollary 2.4 are given in appendix A.

Remark 2.5 (Suboptimality of uniform bounds).
Stronger than corollary 2.4, it is known that there exists a sharp threshold λ (p) such that for any λ > λ (p) the posterior mean, as well as the maximum likelihood estimator, sit w.h.p. in T s(λ) , with s(λ) > 0, while such a statement is false for λ ≤ λ (p) [34][35][36]. The λ 0 given by corollary 2.4 is, on the other hand, clearly not sharp, because of the crude uniform bound used in the proof. This can easily be understood in the p = 2 case, corresponding to rank-one matrix estimation: uniform bounds such as the ones used here would show posterior contraction for λ = ω(1), while it is known through the celebrated BBP transition that the maximum likelihood estimator is already correlated with the signal for any λ > 1 [37]. With more refined techniques from the study of random matrices and spin glass theory of statistical physics it is often possible to obtain precise constants for such relevant thresholds.

(b) Algorithmic bottleneck for MCMC
Simple volume arguments, associated with an ingenious use of Markov's inequality due to Ben Arous et al. [4] and of the rotation invariance of the noise tensor Z, allow us to get a computational hardness result for MCMC algorithms, even though the posterior contracts infinitesimally close to the ground truth as we saw in corollary 2.4. In the context of the spiked tensor model, these computational hardness results can be found in [4] (see in particular §7). We will state similar results for general nonlinear regression models in §3: in this context we will not need to use the Markov's inequality-based technique of Ben Arous et al. [4], and will solely rely on concentration arguments.
Recall that by §2(a), we can find s(λ) such that s(λ) → 1 as λ → ∞ and for all λ large enough Π (T s(λ) |Y) = 1 − O n (1). Here, we show that escaping the 'initialization' region of the MCMC algorithm is hard in a large range of λ (possibly diverging with n). In what follows, the step size of the algorithm denotes the maximal change ||x t+1 − x t || 2 allowed in any iteration. 5 We first state this bottleneck result informally.

Proposition 2.6 (MCMC bottleneck, informal). Assume that
Then any MCMC algorithm whose invariant distribution is Π (·|Y), and with a step size bounded by δ = O([nλ 2 ] −1/p ), will take an exponential time to get out of the 'initialization' region.
Note that the step size condition of proposition 2.6 is always meaningful, since our hypothesis on λ implies [nλ 2 ] −1/p = ω(n −1/2 ), and many MCMC algorithms (e.g. any procedure in which a number O(1) of coordinates of the current iterate are changed in a single iteration) will have a step size O(n −1/2 ).
Remark 2.7. The results of Ben Arous et al. [4] are stated when considering for the invariant distribution of the MCMC a more general ' Gibbs- The case we consider here is the 'Bayes-optimal' β = λ, for which More generally, λ n (p−2)/4 is conjectured to be a regime in which all polynomial-time algorithms fail to recover θ 0 [33,[38][39][40][41]. On the other hand, 'local' methods (such as gradientbased algorithms [42][43][44][45][46], message-passing iterations [35] or natural MCMC algorithms such as the ones of previous remark) are conjectured or known to fail in the larger range λ n (p−2)/2 . Proposition 2.6 shows that 'Bayes-optimal' MCMC algorithms fail for λ n (p−2)/4 . To the best of our knowledge, analysing this class of algorithms in the regime n (p−2)/4 λ n (p−2)/2 is still open.
Let us now state formally the key ingredient behind proposition 2.6. It is a rewriting of the 'free energy wells' result of Ben Arous et al. [4].
. Then for any ε > 0 small enough, there exists c, C > 0 such that for large enough n, with probability at least 1 − exp(−cn 2ε ) we have: Note that by simple volume arguments, Π (S r(ε) ) = 1 − O n (1), so that S r(ε) contains 'almost all' the mass of the uniform distribution.
One can then deduce from lemma 2.8 hitting time lower bounds for MCMCs using a folklore bottleneck argument-see Jerrum [8]-that we recall here in a simplified form (see also [5], as well as proposition 4.4, where we will detail it further along with a short proof). Proposition 2.9. We fix any Y and n, and let any 0 < s < t < 1. Let θ (0) , θ (1) , . . . be a Markov chain on S n−1 with stationary distribution Π (·|Y), and initialized from θ (0) ∼ Π S s (·|Y), the posterior distribution conditioned on S s . Let τ t = inf{k ∈ N : θ (k) ∈ T t } be the hitting time of the Markov chain onto T t . Then, for any k ≥ 1, Remark 2.10 (MCMC initialization). Note that lemma 2.8, combined with proposition 2.9, shows hardness of MCMC initialized in points drawn from Π S r( ) (·|Y). In particular, it is easy to see that this implies (via the probabilistic method) the existence of such 'hard' initializing points. While one might hope to show such negative results for more general initialization, this remains an open problem. On the other hand, Ben Arous et al. [4] shows that there exists initializers in S r( ) for which vanilla Langevin dynamics achieve non-trivial recovery of the signal even for λ = Θ n (1) (a phenomenon they call 'equatorial passes').

Main results for nonlinear regression with Gaussian priors
We now turn to the main contribution of this article, which is to exhibit some of the phenomena described in §2 in the context of nonlinear regression models. All the theorems of this section are proven in detail in §4.
from the random design regression model where G : Θ → L 2 μ (X ) is a regression map taking values in the space L 2 (X ) = L 2 μ (X ) on some bounded subset X of R d , and where the X i ∼ iid μ are drawn uniformly on X . For convenience, we assume that X has Lebesgue measure X dx = 1.
Here θ varies in some parameter space and θ 0 ∈ Θ is a 'ground truth' (we could use 'mis-specified' θ 0 and project it onto Θ). We will primarily consider the case where κ > 0 and Θ = R D , and consider high-dimensional asymptotics where D (and then also N) diverge to infinity, even though some aspects of our proofs do not rely on these assumptions. We will say that events A N hold with high probability if P N θ 0 (A N ) → 1 as N → ∞, and we will use the same terminology later when it involves the law of some Markov chain.
Let Π be a prior (Borel probability measure) on Θ so that given the data Z (N) the posterior measure is the 'Gibbs'-type distribution

(a) Hardness examples for posterior computation with Gaussian priors
We are concerned here with the question of whether one can sample from the Gibbs' measure (3.2) by MCMC algorithms. The priors will be Gaussian, so the 'source' of the difficulty will arise from the log-likelihood function N . On the one hand, recent work [10][11][12][13][14] has demonstrated that if N (θ ) is 'on average' (under E θ 0 ) log-concave, possibly only just locally near the ground truth θ 0 , then MCMC methods that are initialized into the area of log-concavity can mix towards Π (·|Z (N) ) in polynomial time even in high-dimensional (D → ∞) and 'informative' (N → ∞) settings. In the absence of such structural assumptions, however, posterior computation may be intractable, and the purpose of this section is to give some concrete examples for this with choices of G that are representative for nonlinear regression models. We will provide lower bounds on the run-time of 'worst case' initialized MCMC in settings where the average posterior surface is not globally log-concave but still unimodal. Both the loglikelihood function and posterior density exhibit linear growth towards their modes, and the average log-likelihood is locally log-concave at θ 0 . In particular, the Fisher information is well defined and non-singular at the ground truth.
The computational hardness does not arise from a local optimum ('multi-modality'), but from the difficulty MCMC encounters in 'choosing' among many high-dimensional directions when started away from the bulk of the support of the posterior measure. That such problems occur in high dimensions is related to the probabilistic structure of the prior Π , and the manifestation of 'free energy barriers' in the posterior distribution.
In many applications of Bayesian statistics, such as in machine learning or in nonlinear inverse problems with PDEs, Gaussian process priors are commonly used for inference. To connect to such situations we illustrate the key ideas that follow with two canonical examples where the prior on R D is the law where Σ α is the covariance matrix arising from the law of a D-dimensional Whittle-Matérntype Gaussian random field (see §4(d)(i) for a detailed definition). These priors represent widely popular choices in Bayesian statistical inference [47,48] and can be expected to yield consistent statistical solutions of regression problems even when D/N ≥ κ > 0, see [48,49]. In (b), we can also accommodate a further 'rescaling' (N-dependent shrinkage) of the prior similar to what has been used in recent theory for nonlinear inverse problems [ We will present our main results for the case where the ground truth is θ 0 = 0. This streamlines notation while also being the 'hardest' case for negative results, since the priors from (a) and (b) are then already centred at the correct parameter.
To formalize our results, let us define balls (3.4) centred at θ 0 = 0. We will also require the annuli for r, ε > 0 to be chosen. To connect this to the notation in the preceding sections, the sets Θ r,ε will play the role of the initialization (or starting) region S, while B s (for suitable s) corresponds to the target region T where the posterior mass concentrates. The 'intermediate' region W = Θ s,η representing the 'free-energy barrier' is constructed in the proofs of the theorems to follow. Our results hold for general Markov chains whose invariant measure equals the posterior measure (3.2), and which admit a bound on their 'typical' step sizes. As step sizes can be random, this assumption needs to be accommodated in the probabilistic framework describing the transition probabilities of the chain. Let P N (θ , A), N ∈ N, (for θ ∈ R D and Borel sets A ⊆ R D ), denote a sequence of Markov kernels describing the Markov chain dynamics employed for the computation of the posterior distribution Π (·|Z (N) ). Recall that a probability measure μ on R D is Assumption 3.1. Let P N (·, ·) be a sequence of Markov kernels satisfying the following: This assumption states that typical steps of the Markov chain are, with high probability (both under the law of the Markov chain and the randomness of the invariant 'target' measure), concentrated in an area of size η/2 around the current state θ, uniformly in a ball of radius Q around θ 0 = 0. For standard MCMC algorithms (such as pCN, MALA) whose proposal steps are based on the discretization of some continuous-time diffusion process, such conditions can be checked, as we will show in the next section.
There exists ε > 0 such that for any (sequence of) Markov kernels P N on R D and associated chains (ϑ k : k ≥ 1) that satisfy assumption 3.1 for some c 0 > 0, Q = 1 + ε, sequence η N ∈ (0, s) and all N ≥ 1 large enough, we can find an initialization point ϑ 0 ∈ Θ 2/3,ε such that with high probability (under the law of Z (N) and the Markov chain), the hitting time τ B s for ϑ k to reach B s (with s as in (iii)) is lower bounded as τ B s ≥ exp (min{c 0 , 1}N/2) .
The interpretation is that despite the posterior being strictly increasing in the radial variable ||θ || R D (at least for ||θ || R D > r, any r > 0-note that maximizers of the posterior density may deviate from the 'ground truth' θ 0 = 0 by some asymptotically vanishing error, cf. also proposition 4.1), MCMC algorithms started in Θ 2/3,ε will still take an exponential time before visiting the region B s where the posterior mass concentrates. This is true for small enough step size independently of D, N. The result holds also for ϑ 0 drawn from an absolutely continuous distribution on Θ 2/3,ε as inspection of the proof shows. Finally, we note that at the expense of more cumbersome notation, the above high probability results (and similarly in theorem 3.3) could be made non-asymptotic, in the sense that for all δ > 0 all statements hold with probability at least 1 − δ for all N ≥ N 0 (δ) large enough, where the dependency of N 0 on δ can be made explicit.
For 'ellipsoidally supported' α-regular priors (b), the idea is similar but the geometry of the problem changes as the prior now 'prefers' low-dimensional subspaces of R D , forcing the posterior closer towards the ground truth θ 0 = 0. We show that if the step size is small compared to a scaling N −b for b > 0 determined by α, then the same hardness phenomenon persists. Note that 'small' is only 'polynomially small' in N and hence algorithmic hardness does not come from exponentially small step sizes. Again, (iv) holds as well for ϑ 0 drawn from an absolutely continuous distribution on Θ N −b ,εN −b . We also note that ε depends only on α, κ, d and the choice of G but not on any other parameters. Remark 3.4. As opposed to theorem 3.2, due to the anisotropy of the prior density π , the posterior distribution is no longer radially symmetric in the preceding theorem, whence part (ii) differs from theorem 3.2. But a slightly weaker form of monotonicity of the posterior density π (·|Z (N) ) still holds: the same arguments employed to prove part (ii) of theorem 3.2 show that π (·|Z (N) ) is decreasing on {θ : ||θ || R D ≥ rN −b } (any r > 0) along the half-lines through 0, i.e.
We note that this notion precludes the possibility of π (·|Z (N) ) having extremal points outside of the region of dominant posterior mass, and implies that moving toward the origin will always increase the posterior density. As a result, many typical Metropolis-Hastings would be encouraged to accept such 'radially inward' moves, if they arise as a proposal. Thus, crucially, our exponential hitting time lower bound in part (iv) arises not through multi-modality, but merely through volumetric properties of high-dimensional Gaussian measures. suggest to move there, and future proposals will likely be outside of that bulk region, so that the chain will either exit the relevant region again or become deterministic because an accept/reject step refuses to move into such directions. In this sense, a negative result for (polynomially) small step sizes gives fundamental limitations on the ability of the chain to explore the precise characteristics of the posterior distribution. We also remark that the Lipschitz constants of ∇ (θ) are of order D or D 1+b in the preceding theorems, respectively. A Markov chain obtained from discretizing a continuous diffusion process (such as MALA discussed in the next section) will generally require step sizes that are inversely proportional to that Lipschitz constant in order to inherit the dynamics from the continuous process. For such examples, assumption 3.1 is natural. But as discussed at the end of the introduction, there exists a variety of 'non-local' MCMC algorithms for which this step size assumption may not be satisfied.
(b) Implications for common MCMC methods with 'cold-start' The preceding general hitting time bounds apply to commonly used MCMC methods in highdimensional statistics. We focus in particular on algorithms that are popular with PDE models and inverse problems, see, e.g. [50,51] and also [14] for many more references. We illustrate this for two natural examples with Metropolis-Hastings adjusted random walk and gradient algorithms. Other examples can be generated without difficulty.

(i) Preconditioned Crank-Nicolson
We first give some hardness results for the popular pCN algorithm. A dimension-free convergence analysis for pCN was given in the important paper by Hairer et al. [52] based on ideas from Hairer et al. [53]. The results in the present section show that while the mixing bounds from Hairer et al. [52] are in principle uniform in D, the implicit dependence of the constants on the conditions on the log-likelihood-function in [52] can re-introduce exponential scaling when one wants to apply the results from Hairer et al. [52] to concrete (N-dependent) posterior distributions. This confirms a conjecture about pCN made in Section 1.2.1 of Nickl & Wang [12].
Let C denote the covariance of some Gaussian prior on R D with density π . Then the pCN algorithm for sampling from some posterior density π (θ|Z (N) ) ∝ e N (θ) π (θ) is given as follows. Let (ξ k : k ≥ 1) be an i.i.d. sequence of N (0, C) random vectors. For initializer ϑ 0 ∈ R D , step size β > 0 and k ≥ 1, the MCMC chain is then given by

ACCEPT-REJECT: Set
By standard Markov chain arguments one verifies (see [52] or Ch.1 in [14]) that the (unique) invariant density of (ϑ k : k ≥ 1) equals π (·|Z (N) ). We now give a hitting time lower bound for the pCN algorithm which holds true in the regression setting for which the main theorems 3.2 and 3.3 (for generic Markov chains) were derived. In particular, we emphasize that the lower bounds to follow hold for the choice of regression 'forward' map G constructed in the proofs of theorems 3.2 and 3.3. As for the general results, we treat the two cases of C = I D /D or C = Σ α separately.

(ii) Gradient-based Langevin algorithms
We now turn to gradient-based Langevin algorithms which are based on the discretization of continuous-time diffusion processes [10,50]. A polynomial time convergence analysis for the unadjusted Langevin algorithm in the strongly log-concave case has been given in [10,11] and also in [54] for the Metropolis-adjusted case (MALA). We show here that for unimodal but not globally log-concave distributions, the MCMC scheme can take an exponential time to reach the bulk of the posterior distribution. For simplicity we focus on the Metropolis-adjusted Langevin algorithm which is defined as follows. Let (ξ k : k ≥ 1) be a sequence of i.i.d. N (0, I D ) variables, and let γ > 0 be a step size.
Again, standard Markov chain arguments show that Π (·|Z (N) ) is indeed the (unique) invariant distribution of (ϑ k : k ≥ 1). We note here that for the forward G featuring in our results to follows, ∇ log π may only be well-defined (Lebesgue-) almost everywhere on R D due to our piece-wise smooth choice of w, see (4.6) below. However, since all proposal densities involved possess a Lebesgue density, this specification almost everywhere suffices in order to propagate the Markov chain with probability 1. Alternatively one could also straightforwardly avoid this technicality by smoothing our choice of function w in (4.6), which we refrain from for notational ease. There exists some c 1 , c 2 , ε > 0 such that if the step size of (ϑ k : k ≥ 1) satisfies γ ≤ c 1 /N, then there is an initialization point ϑ 0 ∈ Θ 2/3,ε such that the hitting time τ B s = inf{k : ϑ k ∈ B s } (for B s as in (3.4 N (0, Σ α ) prior, and let G also be as in theorem 3.3.

)) satisfies with high probability (under the law of the data and of the Markov chain) as N → ∞ that τ B s ≥ exp(c 2 D). (ii) Assume the setting of theorem 3.3, with a
Then there exist some constant c 1 , c 2 , ε > 0 such that whenever γ ≤ c 1 N −1−b−2α , there is an initialization point ϑ 0 ∈ Θ N −b ,εN −b , such that the hitting time τ B s = inf{k : ϑ k ∈ B s } satisfies with high probability (under the law of the data and of the Markov chain) that τ B s ≥ exp(c 2 D).
As mentioned in remark 3.5, a bound on the step size that is inversely proportional to the Lipschitz constant of ∇ is natural for algorithms like MALA that arise from discretization of a continuous-time Markov process, see e.g. [11,54]. We emphasize again that these Lipschitz constants are D-and N-dependent, so that the required bounds on γ are not unnatural. 'Optimal' step size prescriptions for MALA [54][55][56][57] derived for Gaussian and log-concave targets or, more generally, mean-field limits (in which the posterior distribution possesses a product or mean-field structure, unlike in the models considered here) would need to be adjusted to our model classes to be comparable.

Proofs of the main theorems
We begin in §4a by constructing the family of regression maps G underlying our results from §3. Section 4b,c reduce the hitting time bounds from theorems 3.2 and 3.3 (for general Markov chains) to hitting time bounds for intermediate 'free energy barriers' that the Markov chain needs to travel through. Subsequently, theorems 3.3 and 3.2 are proved in §4d,e, respectively. Finally, the proofs for pCN (theorem 3.6) and MALA (theorem 3.7) are contained in §4f.

(a) Radially symmetric choices of G
We start with our parameterization of the map G . In our regression model and since Eε 2 = 1, We have θ 0 = 0 and by subtracting a fixed function G (0) from G (θ) if necessary we can also assume that G (θ 0 ) = 0. In this case, since vol(X ) = 1, Take a bounded continuous function w : [0, ∞] → [0, ||w|| ∞ ) with a unique minimizer w(0) = 0 and take G of the 'radial' form The assumption G(θ 0 ) = 0 implies Y i = 0 + ε i under P N θ 0 , so that we have (4.3) and the average log-likelihood is Define -annuli of Euclidean space We then also set, for any s ≥ 0, > 0, For our main theorems the map w will be monotone increasing and the preceding notation w − , w + is then not necessary, but proposition 4.2 is potentially also useful in non-monotone settings (as remarked after its proof), hence the slightly more general notation here. The choice that G is radial is convenient in the proofs, but means that the model is only identifiable up to a rotation for θ = 0. One could easily make it identifiable by more intricate choices of G , but the main point for our negative results is that the function has a unique mode at the ground truth parameter θ 0 and is identifiable there.
where T > ρ, are fixed constants to be chosen. Note that w is monotone increasing and The function w is quadratic near its minimum at the origin up until t/2, from when onwards it is piece-wise linear. In the linear regime it initially has a 'steep' ascent of gradient T until t, then grows more slowly with small gradient ρ from t until L, and from then on is constant. The function w is not C ∞ at the points r = t/2, r = t, r = L, but we can easily make it smooth by convolving with a smooth function supported in small neighbourhoods of its breakpoints r without changing the findings that follow. We abstain from this to simplify notation. The following proposition summarizes some monotonicity properties of the empirical loglikelihood function arising from the above choice of w.

(b) Bounds for posterior ratios of annuli
A key quantity in the proofs to follow will be to obtain asymptotic (N → ∞) bounds of the following functional (recalling the definition of the Euclidean annuli Θ r,ε from (4.5)), (4.8) in terms of the map w. As a side note, we remark that this functional has a long history in the statistical physics of glasses, in which it is often referred to as the Franz-Parisi potential [7,58]. (i) Suppose that for some radii 0 < s < σ , constants ε, η, ν > 0 and for all N ≥ 1 large enough, we have Then the posterior distribution Π (·|Z (N) ) from (3.2) arising in the model (3.1) satisfies that with high P N 0 -probability as N → ∞, (4.10) (ii) If in addition w is monotone increasing on [0, ∞) and if for some Q > 1 + ε, then the posterior distribution Π (·|Z (N) ) also satisfies (with high probability as N → ∞) that (4.12) for w from (4.6)). If σ > s > t, for w from (4.6), the 'likelihood' term in proposition 4.2 is

Remark 4.3 (The prior condition
so that if we also assume to control ω N , ω N in the proof that follows, then to verify (4.9) it suffices to check and sup θ∈Θ r, We can now further bound, for our G , We estimate w + (r, ) ≤w(r, ) = max(w + (r, ), 1), and noting that we can use Chebyshev's (or Bernstein's) inequality to construct an event of high probability such that the functional F N from (4.8) is bounded as and where (4.18) and this is uniform in all (r, ) since ||w|| ∞ ≤ W is bounded. Using the above with (r, ) chosen as (s, η) and (σ , ε) respectively, we then obtain with high P N θ 0 -probability. The result now follows from the hypothesis (4.9) and since the terms (Proof of part ii). The proof of part (ii) follows from an obvious modification of the previous arguments.
In the case where Π (Θ s,η ) and Π (Θ σ ,ε ) are comparable (so that the l.h.s. in (4.9) converges to zero), a local optimum at σ in the function w away from zero can verify the last inequality for

(c) Bounds for Markov chain hitting times (i) Hitting time bounds for intermediate sets Θ s,η
In (4.10), we can think of Θ σ ,ε as the 'initialization region' (further away from θ 0 ) and Θ s,η for intermediate s is the 'barrier' before we get close to θ 0 = 0. The last bound permits the following classic hitting time argument, taken from Ben Arous et al. [5], see also [8].
(ii) Reducing hitting times for B s to ones for Θ s,η We now reduce part (iv) of theorems 3.2 and 3.3, i.e. bounds on the hitting time of the region B s in which the posterior contracts, to a bound for the hitting time τ s for the annulus Θ s,η , which is controlled in proposition 4.4. To this end, in the case of theorem 3.2, we suppose that propositions 4.2 and 4.4 are verified with ν = 1σ = 2/3 some ε > 0 and Q, s, η as in the theorem, and in the case of theorem 3.3, we assume the same with choice σ = N −b and ν > 0 given after (4.27) below. For c 0 from assumption 3.1, define the events We can then estimate, using assumption 3.1, that on the frequentist event on which proposition where in the second inequality we have used that on the events A N , the Markov chain ϑ k , when started in Θ 2/3,ε , needs to pass through Θ s,η in order to reach B s .

(d) Proof of theorem 3.3
In this section, we use the results derived in the previous part of §4 to finish the proof of theorem 3.3. Parts (i) and (ii) of the theorem follow from proposition 4.1 and our choice of w in (4.6). We therefore concentrate on the proofs of part (iii) and (iv). We start with proving a key lemma on small ball estimates for truncated α-regular Gaussian priors.
(i) Small ball estimates for α-regular priors Let us first define precisely the notion of α-regular Gaussian priors. For some fixed α > d/2, the prior Π arises as the truncated law Law(θ ) of an α-regular Gaussian process with RKHS H = H α , a Sobolev space over some bounded domain/manifold X , see e.g. section 6.2.1 in [14] for details. Equivalently (under the Parseval isometry) we take a Gaussian Borel measure on the usual sequence space 2 L 2 with RKHS equal to The prior Π is the truncated law of θ D = (θ 1 , . . . , θ D ), D ∈ N.
Then if D/N κ > 0, there exist constantsc 0 > c 0 (depending on b, κ) such that for all N (≥ N 0 (z, b)) large enough: for equivalence constants in depending only on d, α. The upper bound is given in proposition 6.1.1 in [14] and a lower bound can be found as well in the literature [59] (by injecting H α (X 0 ) intoH α (X ) for some strict sub-domain X 0 ⊂ X , and using metric entropy lower bounds for the injection H α (X 0 ) → L 2 (X 0 )).
Using the results about small deviation asymptotics for Gaussian measures in Banach space [60]-specifically theorem 6.2.1 in [14] with a = 2d/(2α − d)-and assuming α > d/2, this means that the concentration function of the 'untruncated prior' satisfies the two-sided estimate Here, restricting to γ ∈ (0, 1), the two-sided equivalence constants depend only on α, d. Setting (4.23) and noting that bτ = 1, we hence obtain that for some constants c l , c u > 0, We now show that as long as D/N ≈ κ > 0, one may use the above asymptotics to derive the desired small ball probabilities for the projected prior on R D .
and hence it only remains to show the first inequality in equation (4.20). The Gaussian isoperimetric theorem (theorem 2.6.12 in [61]) and (4.24) imply that for m ≥ 4 √ c l and some c > 0, we have that (with Φ denoting the c.d.f. for N (0, 1)) (see also the proof of lemma 5.17 in [19] for a similar calculation). Then if the event in the last probability is denoted by I we have On I, if D/N → κ > 0 and by the usual tail estimate for vectors in h α , we have for some c > 0 the bound so that for any z > 0, and hence the lemma follows by appropriately choosing c 0 > 0.

Remark 4.6.
For statistical consistency proofs in nonlinear inverse problems, often rescaled Gaussian priors are used to provide additional regularization [12,13,19]. For these priors a computation analogous to the previous lemma is valid: specifically if we rescale θ by √ Nδ N , where δ N = N −α/(2α+d) so that √ Nδ N = N (d/2)/(2α+d) = N k , then we just take N −β+k = N −b in the above small ball computation, that is −b = −β + k or b = β − k, and the same bounds (as well as the proof to follow) apply.
(ii) Proof of theorem 3.3, part (iv) Lemma 4.5 and the hypotheses on η immediately imply To lower bound Π (Θ N −b ,εN −b ), we choose ε large enough such that which implies for all N large enough that Now, for w from (4.6), we set for T b to be chosen and ρ, L, s b , t b fixed constants, so that ||w|| ∞ is bounded (uniformly in N) by a constant which depends only on T b , L, ρ, whence (4.14) holds. Now the key inequality (4.15) with s = s b N −b and with our choice of η, ε, σ = N −b will be satisfied if We define ν to equal to 1/3 of the l.h.s. so that (4.27) will follow for the given s b , κ, α, d by choosing ε large enough and whenever N is large enough. Finally, let us note that with Q = C √ N for some C ≥ 2E[||θ|| 2 ], where θ is the infinite Gaussian vector with RKHS h α , we can deduce from theorem 2.1.20 and exercise 2.1.5 in [61] that Thus, using also (4.25), choosing C large enough verifies (4.11). Since (4.25) and the a.s. boundedness of sup θ | N (θ )| for N from (4.3) imply that Π (Θ N −b ,εN −b |Z (N) ) > 0 a.s., proposition 4.2 and then also proposition 4.4 apply for this prior, and the arguments from §4c(ii) yield the desired result.

(iii) Proof of theorem 3.3, part (iii)
We finish the proof of the theorem by showing point (iii) . We use the setting and choices from the previous section. Let us write G(A) = A e N (θ) dΠ (θ) for any measurable set A. Recall the notation B r = {θ : ||θ || R D ≤ r}, r > 0. Repeating the argument leading to (4.17) with B t/2 in place of Θ r, , and using lemma 4.5, we have with high probability (1) . Likewise, we also have (1). We can assume that G(B c s ) > 0. Hence, since Π (B c s ) → 1 in view of lemma 4.5, Now, for t b < s b fixed we can choose T b large enough such that the last quantity exceeds 1 with high probability (in particular this retrospectively justifies the last O(1) as then ||w|| ∞ = O(1) for our choice of T b ). Therefore, again with high probability For M t,s = {θ : t/2 < ||θ || R D ≤ s} this further implies that with high probability and then,

Remark 4.7.
If the map w is globally convex, say w(s) = Ts 2 /2 for all s > 0, then a 'large enough' choice of ε after (4.27) is not possible. It is here where global log-concavity of the likelihood function helps, as it enforces a certain 'uniform' spread of the posterior across its support via a global coercivity constant T. By contrast the above example of w is not convex, rather it is very spiked on (0, t/2) and then 'flattens out'.
(e) Proof of theorem 3.2 The proof of theorem 3.2 proceeds along the same lines as that of theorem 3.3, with scaling t, L, ρ, s, η constant in N, corresponding to b = 0 in N −b , and replacing the volumetric lemma 4.5 by the following basic result.
(ii) Proofs for MALA Theorem 3.7 is proved by verifying the hypotheses of theorems 3.2 and 3.3, respectively. A key difference between pCN and MALA is that the proposal kernels for MALA, not just its acceptance probabilities, depend on the data Z (N) itself. Again, we begin by examining part (ii) which regards N(0, Σ α ) priors.
Using this, (4.34) and choosing c > 0 small enough, conditional on the event A the probability Pr(·) under the Markov chain satisfies where the last inequality is proved as in (4.31) above, using theorem 2.5.8 in [61]. Thus, assumption 3.1 is satisfied with c 0 = 1 and the proof is complete.
Then letting s ∈ (0, 1/3) and Q > 0 be as in theorem 3.2, and fixing an arbitrary η ∈ (0, s/2), the above implies that for sufficiently small constant c > 0 and for any γ ≤ c/N, it holds that Thus, choosing c > 0 small enough and arguing exactly as in the last step of the proof of theorem 3.6, part (i), assumption 3.1 is satisfied with c 0 = 1 and the proof is complete.